mean_absolute_error (MAE)#

Mean Absolute Error (MAE) measures the average absolute difference between targets and predictions.

If \(y\) is the true target and \(\hat{y}\) is the prediction, MAE answers:

“On average, how many units am I off?”

It is one of the most interpretable regression metrics because it is expressed in the same units as the target.


Learning goals#

  • Define MAE precisely (with math)

  • Build intuition with plots

  • Implement MAE from scratch in NumPy (including weights / multi-output)

  • Use MAE as an optimization objective (L1 / median regression) with subgradient descent

  • Understand pros/cons and when to use MAE

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import cross_val_score, train_test_split

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

1) Definition#

For \(n\) samples, MAE is:

\[ \mathrm{MAE}(y, \hat{y}) = \frac{1}{n}\sum_{i=1}^{n} |y_i - \hat{y}_i| \]

Let the residual be \(r_i = \hat{y}_i - y_i\). Then MAE is the mean of \(|r_i|\).

Equivalently (vector form for 1D targets): \(\mathrm{MAE}(y, \hat{y}) = \frac{1}{n}\lVert y - \hat{y}\rVert_1\).

Weighted MAE#

With non-negative sample weights \(w_i\):

\[ \mathrm{MAE}_w(y, \hat{y}) = \frac{\sum_{i=1}^{n} w_i |y_i - \hat{y}_i|}{\sum_{i=1}^{n} w_i} \]

Multi-output targets#

If \(y \in \mathbb{R}^{n \times m}\) (multiple outputs), you typically compute MAE per output and then aggregate (e.g. uniform average).

2) Intuition: MAE is “average distance”#

Absolute error is a distance on the number line:

  • it never cancels out (negative errors do not offset positive ones)

  • it is linear: doubling an error doubles its contribution

  • it’s in the same unit as \(y\) (e.g. dollars, °C, minutes)

A useful mental model is “typical miss size”.

# A tiny example
y_true = np.array([3.0, -0.5, 2.0, 7.0])
y_pred = np.array([2.5, 0.0, 2.0, 8.0])

residuals = y_pred - y_true
abs_errors = np.abs(residuals)

mae_manual = abs_errors.mean()
mae_sklearn = mean_absolute_error(y_true, y_pred)

residuals, abs_errors, mae_manual, mae_sklearn
(array([-0.5,  0.5,  0. ,  1. ]), array([0.5, 0.5, 0. , 1. ]), 0.5, 0.5)

Visual: each absolute error is a segment length#

For each sample \(i\), MAE takes the length of the segment between \(y_i\) and \(\hat{y}_i\), then averages those lengths.

idx = np.arange(len(y_true))
mid = 0.5 * (y_true + y_pred)

fig = go.Figure()
fig.add_trace(go.Scatter(x=idx, y=y_true, mode="markers", name="y_true", marker=dict(size=10)))
fig.add_trace(go.Scatter(x=idx, y=y_pred, mode="markers", name="y_pred", marker=dict(size=10)))

for i in idx:
    fig.add_shape(
        type="line",
        x0=int(i), x1=int(i),
        y0=float(y_true[i]), y1=float(y_pred[i]),
        line=dict(color="gray", width=2),
    )

fig.add_trace(
    go.Scatter(
        x=idx,
        y=mid,
        mode="text",
        text=[f"|e|={e:.2f}" for e in abs_errors],
        showlegend=False,
        textposition="middle right",
    )
)

fig.update_layout(
    title=f"Absolute errors per sample (MAE = {mae_manual:.3f})",
    xaxis_title="sample index i",
    yaxis_title="value",
)
fig.show()

3) MAE vs MSE: different penalty shapes#

Both MAE and MSE summarize residuals \(r = \hat{y} - y\), but they penalize them differently:

  • MAE uses \(|r|\) (linear penalty)

  • MSE uses \(r^2\) (quadratic penalty)

So MSE puts much more weight on large errors, while MAE is more robust to outliers.

Subgradient of the absolute value#

The absolute value is not differentiable at 0, but it has a subgradient:

\[\begin{split} \frac{d}{dr}|r| = \begin{cases} +1 & r > 0 \\ -1 & r < 0 \\ \text{any value in }[-1, 1] & r = 0 \end{cases} \end{split}\]

In code we typically use np.sign(r) (and treat exactly-zero residuals as 0).

r = np.linspace(-5, 5, 401)
l1 = np.abs(r)
l2 = r**2

# Huber (smooth L1) is often used as a differentiable alternative
delta = 1.0
huber = np.where(np.abs(r) <= delta, 0.5 * r**2, delta * (np.abs(r) - 0.5 * delta))

fig = go.Figure()
fig.add_trace(go.Scatter(x=r, y=l1, name="MAE pointwise: |r|", line=dict(width=3)))
fig.add_trace(go.Scatter(x=r, y=l2, name="MSE pointwise: r²", line=dict(width=3)))
fig.add_trace(go.Scatter(x=r, y=huber, name="Huber (smooth L1)", line=dict(width=3, dash="dot")))
fig.update_layout(
    title="Pointwise loss vs residual r",
    xaxis_title="residual r = ŷ - y",
    yaxis_title="loss for a single sample",
)
fig.show()

Outlier sensitivity: one bad point#

Suppose \(n-1\) predictions are perfect (\(r=0\)) and one sample has residual \(m\). Then:

\[ \mathrm{MAE} = \frac{|m|}{n} \qquad\qquad \mathrm{MSE} = \frac{m^2}{n} \]

Even though both are averaged, MSE still grows quadratically in \(|m|\), so a single large error can dominate it.

n_samples = 50
m = np.linspace(0, 20, 401)

mae_one = m / n_samples
mse_one = (m**2) / n_samples

fig = go.Figure()
fig.add_trace(go.Scatter(x=m, y=mae_one, name="MAE (one outlier)", line=dict(width=3)))
fig.add_trace(
    go.Scatter(
        x=m,
        y=mse_one,
        name="MSE (one outlier)",
        yaxis="y2",
        line=dict(width=3, dash="dash"),
    )
)
fig.update_layout(
    title=f"Effect of 1 outlier among n={n_samples} samples",
    xaxis_title="outlier magnitude |m|",
    yaxis_title="MAE (same units as y)",
    yaxis2=dict(title="MSE (squared units)", overlaying="y", side="right"),
)
fig.show()

4) A key property: MAE is minimized by the median#

Suppose your model can only output a constant prediction \(c\) (no features). The objective becomes:

\[ J(c) = \frac{1}{n}\sum_{i=1}^{n} |y_i - c| \]

The value(s) of \(c\) that minimize \(J(c)\) are the median(s) of \(y\).

Intuition: moving \(c\) slightly to the right increases the loss by \(+\varepsilon\) for every point left of \(c\), and decreases it by \(-\varepsilon\) for every point right of \(c\). At the optimum those counts balance — that’s the median.

y = rng.normal(loc=0.0, scale=1.0, size=250)
y[:6] += 8  # a few big outliers

c_grid = np.linspace(y.min() - 1, y.max() + 1, 500)
mae_c = np.mean(np.abs(y[:, None] - c_grid[None, :]), axis=0)

c_argmin = float(c_grid[np.argmin(mae_c)])
y_median = float(np.median(y))
y_mean = float(y.mean())

fig = go.Figure()
fig.add_trace(go.Scatter(x=c_grid, y=mae_c, name="MAE(c)", line=dict(width=3)))
fig.add_vline(x=y_median, line=dict(color="green", dash="dash"), annotation_text="median(y)")
fig.add_vline(x=y_mean, line=dict(color="red", dash="dot"), annotation_text="mean(y)")
fig.add_vline(x=c_argmin, line=dict(color="black"), annotation_text="argmin")
fig.update_layout(
    title="For a constant predictor, MAE is minimized at the median",
    xaxis_title="constant prediction c",
    yaxis_title="MAE(c)",
)
fig.show()

y_mean, y_median, c_argmin
(0.14334962923664443, -0.06675038086235976, -0.07859386281040903)

5) NumPy implementation from scratch#

Below is a small NumPy implementation that matches scikit-learn’s behavior for:

  • sample_weight (weighted average over samples)

  • multi-output targets with multioutput:

    • "raw_values" (per-output MAE)

    • "uniform_average" (average of per-output MAEs)

    • an array of output weights

def mean_absolute_error_np(y_true, y_pred, *, sample_weight=None, multioutput="uniform_average"):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    if y_true.shape != y_pred.shape:
        raise ValueError(f"shape mismatch: y_true{y_true.shape} vs y_pred{y_pred.shape}")

    if y_true.ndim == 1:
        abs_err = np.abs(y_true - y_pred)
        if sample_weight is None:
            return float(abs_err.mean())

        w = np.asarray(sample_weight)
        if w.shape != abs_err.shape:
            raise ValueError(f"sample_weight must have shape {abs_err.shape}, got {w.shape}")
        if np.any(w < 0):
            raise ValueError("sample_weight must be non-negative")
        return float(np.sum(w * abs_err) / np.sum(w))

    if y_true.ndim != 2:
        raise ValueError("y_true must be 1D or 2D")

    abs_err = np.abs(y_true - y_pred)  # (n_samples, n_outputs)

    if sample_weight is None:
        output_errors = abs_err.mean(axis=0)
    else:
        w = np.asarray(sample_weight)
        if w.ndim != 1 or w.shape[0] != abs_err.shape[0]:
            raise ValueError(
                f"sample_weight must have shape (n_samples,), got {w.shape} for n_samples={abs_err.shape[0]}"
            )
        if np.any(w < 0):
            raise ValueError("sample_weight must be non-negative")
        output_errors = np.sum(abs_err * w[:, None], axis=0) / np.sum(w)

    if multioutput == "raw_values":
        return output_errors
    if multioutput == "uniform_average":
        return float(output_errors.mean())

    weights = np.asarray(multioutput)
    if weights.shape != (abs_err.shape[1],):
        raise ValueError(
            f"multioutput weights must have shape (n_outputs,), got {weights.shape} for n_outputs={abs_err.shape[1]}"
        )
    if np.any(weights < 0):
        raise ValueError("multioutput weights must be non-negative")
    return float(np.average(output_errors, weights=weights))

# Quick check vs scikit-learn
y_true_2d = rng.normal(size=(12, 3))
y_pred_2d = y_true_2d + rng.normal(scale=0.2, size=(12, 3))
w = rng.uniform(0.5, 2.0, size=12)

ours = mean_absolute_error_np(y_true_2d, y_pred_2d, sample_weight=w, multioutput="raw_values")
sk = mean_absolute_error(y_true_2d, y_pred_2d, sample_weight=w, multioutput="raw_values")

ours, sk, np.allclose(ours, sk)
(array([0.1673, 0.1563, 0.1535]), array([0.1673, 0.1563, 0.1535]), True)

6) Using MAE to fit a model: L1 regression (median regression)#

If we use MAE as the training objective, we get L1 regression. For a linear model:

\[ \hat{y} = Xw + b \]

the MAE objective is:

\[ J(w, b) = \frac{1}{n}\sum_{i=1}^{n} |(Xw + b)_i - y_i| \]

This is convex, but not differentiable everywhere. A common low-level optimizer is subgradient descent.

Let \(r = Xw + b - y\) and \(s = \mathrm{sign}(r)\) (using 0 when \(r=0\)). One valid subgradient is:

\[ \nabla_w J = \frac{1}{n} X^\top s\qquad\qquad \frac{\partial J}{\partial b} = \frac{1}{n} \sum_{i=1}^n s_i \]

Statistical interpretation: minimizing MAE corresponds to maximum likelihood under Laplace (double exponential) noise, and it targets the conditional median (robust to outliers).

# Synthetic regression data with outliers
n = 220
x = rng.uniform(0, 10, size=n)
X = x[:, None]

beta0_true = 1.0
beta1_true = 2.0

# Laplace noise matches the MAE/L1 modeling assumption
noise = rng.laplace(loc=0.0, scale=1.0, size=n)
y = beta0_true + beta1_true * x + noise

# Inject a few large outliers
outlier_idx = rng.choice(n, size=7, replace=False)
y[outlier_idx] += rng.normal(loc=25.0, scale=5.0, size=outlier_idx.size)

outlier_idx[:5], x.min(), x.max()
(array([131,  65,  31,  24, 118]), 0.054298341287004614, 9.991047304742386)
# Fit ordinary least squares (OLS): minimizes MSE, not MAE
ols = LinearRegression().fit(X, y)
y_pred_ols = ols.predict(X)

mae_ols = mean_absolute_error(y, y_pred_ols)
mse_ols = mean_squared_error(y, y_pred_ols)

(ols.intercept_, ols.coef_[0]), (mae_ols, mse_ols)
((1.5939862944862195, 2.034722899043242),
 (1.9884324902995665, 21.556270780903915))
def fit_linear_regression_mae_subgradient(X, y, *, lr0=0.8, n_iters=3000):
    """Minimize MAE for y_hat = X @ w + b using subgradient descent.

    Uses a decaying learning rate: lr_t = lr0 / sqrt(t+1).
    """
    X = np.asarray(X)
    y = np.asarray(y)
    n_samples, n_features = X.shape

    w = np.zeros(n_features)
    b = float(np.median(y))  # good starting point when w=0
    history = np.empty(n_iters)

    for t in range(n_iters):
        y_hat = X @ w + b
        r = y_hat - y
        s = np.sign(r)  # subgradient of |r|

        grad_w = (X.T @ s) / n_samples
        grad_b = s.mean()

        lr = lr0 / np.sqrt(t + 1)
        w -= lr * grad_w
        b -= lr * grad_b

        history[t] = np.mean(np.abs(r))

    return w, b, history

# Feature scaling improves optimization stability
x_mean = float(x.mean())
x_std = float(x.std())
X_scaled = ((x - x_mean) / x_std)[:, None]

w_scaled, b_scaled, hist = fit_linear_regression_mae_subgradient(X_scaled, y)

# Convert back to original x scale
w_mae = w_scaled[0] / x_std
b_mae = b_scaled - w_scaled[0] * x_mean / x_std

y_pred_mae = w_mae * x + b_mae

mae_mae = mean_absolute_error(y, y_pred_mae)
mse_mae = mean_squared_error(y, y_pred_mae)

(b_mae, w_mae), (mae_mae, mse_mae)
((1.0142419496584196, 2.001810613820674),
 (1.8120687855249078, 22.118168492697624))
fig = px.line(
    y=hist,
    title="Subgradient descent: MAE objective vs iteration",
    labels={"index": "iteration", "y": "train MAE"},
)
fig.show()
x_line = np.linspace(x.min(), x.max(), 250)

y_line_true = beta0_true + beta1_true * x_line
y_line_ols = ols.intercept_ + ols.coef_[0] * x_line
y_line_mae = b_mae + w_mae * x_line

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=x, y=y, mode="markers", name="data",
        marker=dict(size=7, color="rgba(0,0,0,0.55)"),
    )
)
fig.add_trace(
    go.Scatter(
        x=x[outlier_idx], y=y[outlier_idx], mode="markers", name="outliers",
        marker=dict(size=10, color="crimson", symbol="x"),
    )
)
fig.add_trace(
    go.Scatter(
        x=x_line, y=y_line_true, mode="lines", name="true line",
        line=dict(color="green", dash="dash"),
    )
)
fig.add_trace(
    go.Scatter(
        x=x_line, y=y_line_ols, mode="lines",
        name=f"OLS (MSE) fit | MAE={mae_ols:.2f}",
        line=dict(width=3),
    )
)
fig.add_trace(
    go.Scatter(
        x=x_line, y=y_line_mae, mode="lines",
        name=f"L1 (MAE) fit | MAE={mae_mae:.2f}",
        line=dict(width=3),
    )
)
fig.update_layout(
    title="Outliers pull OLS more than L1/MAE regression",
    xaxis_title="x",
    yaxis_title="y",
)
fig.show()

7) Practical usage (scikit-learn)#

MAE is typically used as an evaluation metric on validation/test sets and for model selection.

In scikit-learn:

from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_true, y_pred)

For cross-validation, scikit-learn uses a “bigger is better” convention, so it exposes MAE as negative MAE:

cross_val_score(model, X, y, scoring="neg_mean_absolute_error")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

model = LinearRegression().fit(X_train, y_train)
y_test_pred = model.predict(X_test)

test_mae = mean_absolute_error(y_test, y_test_pred)
test_mae
1.9732338388900619
scores = cross_val_score(LinearRegression(), X, y, scoring="neg_mean_absolute_error", cv=5)
mae_scores = -scores

mae_scores, mae_scores.mean()
(array([2.1155, 2.1505, 2.1875, 2.    , 1.6525]), 2.021217652798245)

Pros, cons, and when to use MAE#

Pros#

  • Interpretable: same units as the target (easy to communicate)

  • Robust-ish to outliers: large errors grow linearly, not quadratically

  • Median-targeting: minimizing MAE targets the conditional median (useful when outliers/heavy tails exist)

Cons#

  • Not differentiable at 0: requires subgradients or smooth approximations (Huber / smooth L1)

  • Doesn’t emphasize large errors: if big misses are especially costly, MAE may under-penalize them

  • Scale-dependent: MAE values can’t be compared across targets with different units/scales

Good use cases#

  • When the cost of an error grows roughly linearly with its magnitude

  • When your data has outliers or heavy-tailed noise and you want a stable “typical error” number

  • When you care about the median behavior rather than the mean


Common pitfalls / diagnostics#

  • If \(y\) has a wide range, MAE can look “large” even for good models; compare to a baseline (e.g. predict median).

  • MAE alone hides error distribution; pair it with a plot of residuals or a percentile-based metric.

  • If your business cost is asymmetric (over-prediction vs under-prediction differ), consider pinball loss / quantile regression instead of plain MAE.


Exercises#

  1. Prove that the minimizer of \(\sum_i |y_i - c|\) is any median of \(y\).

  2. Modify mean_absolute_error_np to support an axis argument.

  3. Compare MAE vs RMSE on the same dataset as you add more outliers. What changes first?


References#

  • scikit-learn: sklearn.metrics.mean_absolute_error

  • Robust regression / L1 loss: connection to Laplace noise and median regression